The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:
Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.
Data Projection: Dimension projection of the compressed cells to 1D,2D or Interactive surface plot with the Sammons Non-linear Algorithm. This step creates topology preserving map (also called as embedding) coordinates into the desired output dimension.
Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks.
Scoring: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required.
In this section, we will see how we can use the package to visualize multidimensional data by projecting them to two dimensions using Sammon’s projection and further used for Scoring
Data Understanding
First of all, let us see how to generate data for torus. We are using
a library geozoo for this purpose. Geo Zoo (stands for
Geometric Zoo) is a compilation of geometric objects ranging from three
to 10 dimensions. Geo Zoo contains regular or well-known objects, eg
cube and sphere, and some abstract objects, e.g. Boy’s surface, Torus
and Hyper-Torus.
Here, we will generate a 3D torus (a torus is a surface of revolution generated by revolving a circle in three-dimensional space one full revolution about an axis that is coplanar with the circle) with 12000 points.
Raw Torus Dataset
The torus dataset includes the following columns:
Lets, explore the torus dataset containing 12000 points. For the sake of brevity we are displaying first 6 rows.
set.seed(240)
# Here p represents dimension of object, n represents number of points
torus <- geozoo::torus(p = 3,n = 12000)
torus_df <- data.frame(torus$points)
colnames(torus_df) <- c("x","y","z")
torus_df <- torus_df %>% round(4)
Table(head(torus_df), scroll = FALSE)| x | y | z |
|---|---|---|
| -2.6282 | 0.5656 | -0.7253 |
| -1.4179 | -0.8903 | 0.9455 |
| -1.0308 | 1.1066 | -0.8731 |
| 1.8847 | 0.1895 | 0.9944 |
| -1.9506 | -2.2507 | 0.2071 |
| -1.4824 | 0.9229 | 0.9672 |
Now let’s have a look at structure and summary of the torus dataset.
str(torus_df)
#> 'data.frame': 12000 obs. of 3 variables:
#> $ x: num -2.63 -1.42 -1.03 1.88 -1.95 ...
#> $ y: num 0.566 -0.89 1.107 0.19 -2.251 ...
#> $ z: num -0.725 0.946 -0.873 0.994 0.207 ...data_table <- summary_eda(torus_df)
Table(data_table, scroll = TRUE)| variable | min | 1st Quartile | median | mean | sd | 3rd Quartile | max | hist | n_row | n_missing |
|---|---|---|---|---|---|---|---|---|---|---|
| x | -2.9977 | -1.1490 | -0.0070 | -0.0014 | 1.5060 | 1.1403 | 2.9995 | ▅▇▇▇▅ | 12000 | 0 |
| y | -2.9993 | -1.1133 | 0.0130 | 0.0103 | 1.4856 | 1.1337 | 2.9993 | ▃▇▇▇▅ | 12000 | 0 |
| z | -1.0000 | -0.7120 | 0.0153 | 0.0044 | 0.7118 | 0.7186 | 1.0000 | ▇▃▃▃▇ | 12000 | 0 |
Now, let us check the Data Distribution of the torus dataset.
Variable Histograms
Shown below is the distribution of all the variables in the torus dataset.
eda_cols <- names(torus_df)
dist_list <- lapply(1:length(eda_cols), function(i){
generateDistributionPlot(torus_df, eda_cols[i]) })
do.call(gridExtra::grid.arrange, args = list(grobs = dist_list, ncol = 2, top = "Distribution of Features"))Box Plots
In this section, we plot box plots for each numeric column in the torus dataset across panels. These plots will display the median and Inter Quartile Range of each column at a panel level.
box_plots <- list()
for (x in names(torus_df)) {
box_plots[[x]] <- quantile_outlier_plots_fn(data = torus_df, outlier_check_var = x)[[1]]}
gridExtra::grid.arrange(grobs = box_plots, ncol = 3)Correlation Matrix
In this section we are calculating pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output shown is a matrix.
Train - Test Split
Let us split the torus dataset into train and test. We will randomly select 80% of the torus dataset as train and remaining as test.
smp_size <- floor(0.80 * nrow(torus_df))
set.seed(279)
train_ind <- sample(seq_len(nrow(torus_df)), size = smp_size)
torus_train <- torus_df[train_ind, ]
torus_test <- torus_df[-train_ind, ]Training Dataset
Now, lets have a look at the selected training dataset containing (9600 data points). For the sake of brevity we are displaying first six rows.
rownames(torus_train) <- NULL
Table(head(torus_train), scroll= FALSE)| x | y | z |
|---|---|---|
| 1.7958 | -0.4204 | -0.9878 |
| 0.7115 | -2.3528 | -0.8889 |
| 1.9285 | 1.2034 | 0.9620 |
| 1.0175 | 0.0344 | -0.1894 |
| -0.2736 | 1.1298 | -0.5464 |
| 1.8976 | 2.2391 | 0.3545 |
Now lets have a look at structure and summary of the training dataset.
str(torus_train)
#> 'data.frame': 9600 obs. of 3 variables:
#> $ x: num 1.796 0.712 1.929 1.018 -0.274 ...
#> $ y: num -0.4204 -2.3528 1.2034 0.0344 1.1298 ...
#> $ z: num -0.988 -0.889 0.962 -0.189 -0.546 ...train_table <- summary_eda(torus_train)
Table(train_table, scroll = TRUE)| variable | min | 1st Quartile | median | mean | sd | 3rd Quartile | max | hist | n_row | n_missing |
|---|---|---|---|---|---|---|---|---|---|---|
| x | -2.9973 | -1.1514 | -0.0102 | -0.0055 | 1.5057 | 1.1254 | 2.9995 | ▅▇▇▇▅ | 9600 | 0 |
| y | -2.9993 | -1.1078 | 0.0209 | 0.0163 | 1.4832 | 1.1377 | 2.9993 | ▃▇▇▇▅ | 9600 | 0 |
| z | -1.0000 | -0.7067 | 0.0147 | 0.0046 | 0.7100 | 0.7168 | 1.0000 | ▇▃▃▃▇ | 9600 | 0 |
Now, lets check the Data Distribution of the training dataset.
Variable Histograms
Shown below is the distribution of all the variables in the training dataset
eda_cols <- names(torus_train)
dist_list <- lapply(1:length(eda_cols), function(i){
generateDistributionPlot(torus_train, eda_cols[i]) })
do.call(gridExtra::grid.arrange, args = list(grobs = dist_list, ncol = 2, top = "Distribution of Features"))Testing Dataset
Now, lets have a look at testing dataset containing(2400 data points).For the sake of brevity we are displaying first six rows.
rownames(torus_test) <- NULL
Table(head(torus_test), scroll = FALSE)| x | y | z |
|---|---|---|
| -2.6282 | 0.5656 | -0.7253 |
| 2.7471 | -0.9987 | -0.3848 |
| -2.4446 | -1.6528 | 0.3097 |
| -2.6487 | -0.5745 | 0.7040 |
| -0.2676 | -1.0800 | -0.4611 |
| -1.1130 | -0.6516 | -0.7040 |
Now lets have a look at structure and summary of the testing dataset.
str(torus_test)
#> 'data.frame': 2400 obs. of 3 variables:
#> $ x: num -2.628 2.747 -2.445 -2.649 -0.268 ...
#> $ y: num 0.566 -0.999 -1.653 -0.575 -1.08 ...
#> $ z: num -0.725 -0.385 0.31 0.704 -0.461 ...test_table <- summary_eda(torus_test)
Table(test_table, scroll = TRUE)| variable | min | 1st Quartile | median | mean | sd | 3rd Quartile | max | hist | n_row | n_missing |
|---|---|---|---|---|---|---|---|---|---|---|
| x | -2.9977 | -1.1310 | 0.0001 | 0.0147 | 1.5072 | 1.1934 | 2.9908 | ▅▇▇▇▅ | 2400 | 0 |
| y | -2.9918 | -1.1314 | -0.0001 | -0.0133 | 1.4951 | 1.1118 | 2.9861 | ▃▇▇▇▅ | 2400 | 0 |
| z | -1.0000 | -0.7337 | 0.0157 | 0.0036 | 0.7192 | 0.7311 | 1.0000 | ▇▃▃▃▇ | 2400 | 0 |
Now, lets check the Data Distribution of the testing dataset.
Variable Histograms
Shown below is the distribution of all the variables in the testing dataset
eda_cols <- names(torus_test)
dist_list <- lapply(1:length(eda_cols), function(i){
generateDistributionPlot(torus_test, eda_cols[i]) })
do.call(gridExtra::grid.arrange, args = list(grobs = dist_list, ncol = 2, top = "Distribution of Features"))Let us try to visualize the compressed Map A from the diagram below.
Figure 1: Data Segregation with highlighted bounding box in red around compressed map A
This package can perform vector quantization using the following algorithms -
For more information on vector quantization, refer the following link.
The trainHVT function constructs highly compressed hierarchical Voronoi tessellations. The raw data is first scaled and this scaled data is supplied as input to the vector quantization algorithm. The vector quantization algorithm compresses the dataset until a user-defined compression percentage rate is achieved using a parameter called quantization error which acts as a threshold and determines the compression percentage. It means that for a given user-defined compression percentage we get the ‘n’ number of cells, then all of these cells formed will have a quantization error less than the threshold quantization error.
Let’s try to comprehend the trainHVT function first before moving ahead.
trainHVT(
dataset,
min_compression_perc,
n_cells,
depth,
quant.err,
distance_metric = c("L1_Norm", "L2_Norm"),
error_metric = c("mean", "max"),
quant_method = c("kmeans", "kmedoids"),
normalize = TRUE,
diagnose = FALSE,
hvt_validation = FALSE,
train_validation_split_ratio = 0.8
)Each of the parameters of trainHVT function have been explained below:
dataset - A dataframe, with numeric
columns (features) that will be used for training the model.
min_compression_perc - An integer,
indicating the minimum compression percentage to be achieved for the
dataset. It indicates the desired level of reduction in dataset size
compared to its original size.
n_cells - An integer, indicating
the number of cells per hierarchy (level). This parameter determines the
granularity or level of detail in the hierarchical vector
quantization.
depth - An integer, indicating the
number of levels. A depth of 1 means no hierarchy (single level), while
higher values indicate multiple levels (hierarchy).
quant.error - A number indicating
the quantization error threshold. A cell will only breakdown into
further cells if the quantization error of the cell is above the defined
quantization error threshold.
projection.scale - A number
indicating the scale factor for the tessellations so as to visualize the
sub-tesselations well enough. It helps in adjusting the visual
representation of the hierarchy to make the sub-tesselations more
visible.
scale_summary - A list with mean
and standard deviation values for all the features in the dataset. Pass
the scale summary when the input dataset is already scaled or normalize
is set to False.
distance_metric - The distance
metric can be L1_Norm(Manhattan) or
L2_Norm(Eucledian). L1_Norm is selected by
default. The distance metric is used to calculate the distance between
an n dimensional point and centroid.
error_metric - The error metric can
be mean or max. max is selected
by default. max will return the max of m
values and mean will take mean of m values
where each value is a distance between a point and centroid of the
cell.
quant_method - The quantization
method can be kmeans or kmedoids. Kmeans uses
means (centroids) as cluster centers while Kmedoids uses actual data
points (medoids) as cluster centers. kmeans is selected by
default.
normalize - A logical value
indicating if the dataset should be normalized. When set to TRUE, scales
the values of all features to have a mean of 0 and a standard deviation
of 1 (Z-score).
diagnose - A logical value
indicating whether user wants to perform diagnostics on the model.
Default value is FALSE.
hvt_validation - A logical value
indicating whether user wants to holdout a validation set and find mean
absolute deviation of the validation points from the centroid. These
values can be found in the 6th element of the result list under the
model.info under the title validation_result. Default value is
FALSE.
train_validation_split_ratio - A
numeric value indicating train validation split ratio. This argument is
only used when hvt_validation has been set to TRUE. Default value for
the argument is 0.8.
The output of trainHVT function (list of 6 elements) have been explained below:
The ‘1st element’ is a list containing information related to plotting tessellations. This information might include coordinates, boundaries, or other details necessary for visualizing the tessellations.
The ‘2nd element’ is a list containing information related to Sammon’s projection coordinates of the data points in the reduced-dimensional space.
The ‘3rd element’ is a list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell.
The ‘4th element’ is a list that contains all the diagnostics information of the model when diagnose is set to TRUE. Otherwise NA.
The ‘5th element’ is a list that contains all the information required to generates a Mean Absolute Deviation (MAD) plot, if hvt_validation is set to TRUE. Otherwise NA.
The ‘6th element’ (model info) is a list that contains model generated timestamp, input parameters passed to the model and the validation results.
We will use the trainHVT function to compress our data
while preserving essential features of the dataset. Our goal is to
achieve data compression upto atleast 80%. In situations
where the compression ratio does not meet the desired target, we can
explore adjusting the model parameters as a potential solution. This
involves making modifications to parameters such as the
quantization error threshold or
increasing the number of cells and then rerunning the
trainHVT function again.
As this is already done in HVT Vignette: please refer for more information.
Model Parameters
set.seed(240)
torus_mapA <- trainHVT(
torus_train,
n_cells = 900,
depth = 1,
quant.err = 0.1,
projection.scale = 10,
normalize = FALSE,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans"
)Let’s check the compression summary for torus.
compressionSummaryTable(torus_mapA[[3]]$compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 900 | 749 | 0.83 | n_cells: 900 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
We successfully compressed 83% of the data using n_cells parameter as 900, the next step involves performing data projection on the compressed data. In this step, the compressed data will be transformed and projected onto a lower-dimensional space to visualize and analyze the data in a more manageable form.
As per the manual, torus_mapA[[3]]
gives us detailed information about the hierarchical vector quantized
data. torus_mapA[[3]][['summary']] gives a
nice tabular data containing no of points, Quantization Error and the
codebook.
The datatable displayed below is the summary from torus_mapA showing Cell.ID, Centroids and Quantization Error for each of the 900 cells. For the sake of brevity, we are displaying only the first 100 rows.
summaryTable(torus_mapA[[3]]$summary,scroll = TRUE,limit = 100)| Segment.Level | Segment.Parent | Segment.Child | n | Cell.ID | Quant.Error | x | y | z |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 11 | 82 | 0.07 | -0.99 | -2.14 | 0.93 |
| 1 | 1 | 2 | 8 | 164 | 0.07 | 1.91 | -1.80 | 0.78 |
| 1 | 1 | 3 | 11 | 392 | 0.05 | 1.03 | -0.88 | -0.76 |
| 1 | 1 | 4 | 6 | 126 | 0.04 | 0.21 | -2.08 | 0.99 |
| 1 | 1 | 5 | 6 | 786 | 0.05 | 0.53 | 1.91 | -1.00 |
| 1 | 1 | 6 | 12 | 831 | 0.08 | 1.70 | 1.77 | 0.89 |
| 1 | 1 | 7 | 10 | 460 | 0.06 | -0.96 | 0.30 | 0.11 |
| 1 | 1 | 8 | 7 | 399 | 0.05 | 0.72 | -0.69 | 0.04 |
| 1 | 1 | 9 | 14 | 661 | 0.09 | 2.01 | 0.31 | -1.00 |
| 1 | 1 | 10 | 7 | 169 | 0.07 | -1.53 | -1.31 | -1.00 |
| 1 | 1 | 11 | 8 | 544 | 0.06 | -1.29 | 0.93 | -0.91 |
| 1 | 1 | 12 | 7 | 48 | 0.07 | -1.84 | -2.04 | -0.67 |
| 1 | 1 | 13 | 10 | 698 | 0.07 | 0.60 | 1.40 | 0.88 |
| 1 | 1 | 14 | 10 | 3 | 0.12 | -1.33 | -2.64 | 0.25 |
| 1 | 1 | 15 | 13 | 416 | 0.05 | -1.05 | -0.01 | 0.32 |
| 1 | 1 | 16 | 7 | 442 | 0.06 | 2.41 | -0.73 | 0.85 |
| 1 | 1 | 17 | 10 | 704 | 0.08 | 0.22 | 1.46 | -0.85 |
| 1 | 1 | 18 | 7 | 238 | 0.05 | -1.54 | -0.81 | -0.96 |
| 1 | 1 | 19 | 14 | 320 | 0.06 | -0.56 | -0.83 | 0.02 |
| 1 | 1 | 20 | 6 | 306 | 0.06 | -0.90 | -0.77 | -0.58 |
| 1 | 1 | 21 | 11 | 124 | 0.12 | 2.21 | -2.00 | -0.15 |
| 1 | 1 | 22 | 8 | 526 | 0.05 | 0.98 | 0.23 | -0.05 |
| 1 | 1 | 23 | 5 | 877 | 0.11 | 0.71 | 2.84 | 0.36 |
| 1 | 1 | 24 | 5 | 853 | 0.05 | 1.15 | 2.31 | -0.81 |
| 1 | 1 | 25 | 10 | 550 | 0.08 | 2.05 | -0.13 | -0.99 |
| 1 | 1 | 26 | 7 | 892 | 0.11 | 2.34 | 1.86 | -0.09 |
| 1 | 1 | 27 | 5 | 682 | 0.07 | -2.47 | 1.56 | 0.37 |
| 1 | 1 | 28 | 11 | 752 | 0.07 | -0.26 | 1.88 | -0.99 |
| 1 | 1 | 29 | 10 | 12 | 0.1 | -0.40 | -2.94 | 0.23 |
| 1 | 1 | 30 | 11 | 308 | 0.09 | -2.99 | 0.24 | -0.04 |
| 1 | 1 | 31 | 14 | 734 | 0.09 | -1.99 | 1.90 | 0.65 |
| 1 | 1 | 32 | 8 | 652 | 0.08 | -0.43 | 1.38 | 0.84 |
| 1 | 1 | 33 | 12 | 688 | 0.08 | -1.06 | 1.66 | 1.00 |
| 1 | 1 | 34 | 5 | 795 | 0.05 | -0.61 | 2.35 | -0.90 |
| 1 | 1 | 35 | 8 | 282 | 0.06 | -0.65 | -0.92 | 0.49 |
| 1 | 1 | 36 | 9 | 44 | 0.08 | 0.08 | -2.79 | -0.61 |
| 1 | 1 | 37 | 10 | 558 | 0.08 | -0.72 | 0.96 | -0.61 |
| 1 | 1 | 38 | 4 | 85 | 0.04 | 0.35 | -2.48 | 0.86 |
| 1 | 1 | 39 | 13 | 673 | 0.06 | 1.07 | 0.98 | -0.83 |
| 1 | 1 | 40 | 5 | 609 | 0.04 | 1.17 | 0.61 | 0.73 |
| 1 | 1 | 41 | 6 | 743 | 0.08 | -2.32 | 1.89 | 0.05 |
| 1 | 1 | 42 | 12 | 779 | 0.11 | 1.96 | 1.14 | 0.96 |
| 1 | 1 | 43 | 10 | 726 | 0.06 | -0.07 | 1.74 | 0.97 |
| 1 | 1 | 44 | 6 | 65 | 0.05 | 0.54 | -2.66 | 0.70 |
| 1 | 1 | 45 | 15 | 833 | 0.1 | 1.81 | 1.66 | -0.88 |
| 1 | 1 | 46 | 11 | 589 | 0.07 | 0.68 | 0.77 | 0.25 |
| 1 | 1 | 47 | 12 | 240 | 0.1 | -1.60 | -0.67 | 0.96 |
| 1 | 1 | 48 | 14 | 705 | 0.07 | 1.74 | 0.76 | 0.99 |
| 1 | 1 | 49 | 8 | 557 | 0.05 | 0.86 | 0.52 | -0.11 |
| 1 | 1 | 50 | 10 | 230 | 0.05 | -1.93 | -0.53 | 1.00 |
| 1 | 1 | 51 | 9 | 227 | 0.06 | -0.23 | -1.35 | 0.78 |
| 1 | 1 | 52 | 12 | 127 | 0.1 | -1.89 | -1.27 | 0.96 |
| 1 | 1 | 53 | 15 | 464 | 0.07 | -1.00 | 0.35 | 0.32 |
| 1 | 1 | 54 | 10 | 865 | 0.09 | 1.23 | 2.42 | 0.69 |
| 1 | 1 | 55 | 13 | 121 | 0.08 | -2.27 | -1.07 | 0.86 |
| 1 | 1 | 56 | 8 | 357 | 0.08 | -2.06 | 0.09 | 1.00 |
| 1 | 1 | 57 | 7 | 769 | 0.06 | -0.75 | 2.15 | -0.96 |
| 1 | 1 | 58 | 13 | 425 | 0.1 | -2.69 | 0.55 | 0.66 |
| 1 | 1 | 59 | 11 | 556 | 0.04 | 0.85 | 0.53 | 0.09 |
| 1 | 1 | 60 | 10 | 100 | 0.13 | -2.52 | -1.05 | 0.67 |
| 1 | 1 | 61 | 10 | 811 | 0.1 | -0.87 | 2.58 | -0.68 |
| 1 | 1 | 62 | 8 | 135 | 0.05 | 0.56 | -2.08 | 0.99 |
| 1 | 1 | 63 | 11 | 580 | 0.08 | -0.61 | 1.09 | 0.65 |
| 1 | 1 | 64 | 7 | 610 | 0.06 | -1.75 | 1.25 | -0.99 |
| 1 | 1 | 65 | 8 | 86 | 0.09 | -0.46 | -2.34 | -0.92 |
| 1 | 1 | 66 | 12 | 250 | 0.08 | -1.97 | -0.45 | -1.00 |
| 1 | 1 | 67 | 9 | 368 | 0.06 | -1.45 | -0.20 | -0.84 |
| 1 | 1 | 68 | 11 | 163 | 0.09 | 1.19 | -2.08 | -0.91 |
| 1 | 1 | 69 | 7 | 448 | 0.1 | -2.90 | 0.67 | 0.15 |
| 1 | 1 | 70 | 12 | 615 | 0.12 | 1.47 | 0.43 | 0.88 |
| 1 | 1 | 71 | 8 | 280 | 0.05 | -0.34 | -1.09 | -0.51 |
| 1 | 1 | 72 | 14 | 420 | 0.08 | 0.95 | -0.66 | -0.54 |
| 1 | 1 | 73 | 14 | 336 | 0.07 | -0.77 | -0.68 | 0.23 |
| 1 | 1 | 74 | 20 | 402 | 0.1 | 1.39 | -0.74 | 0.90 |
| 1 | 1 | 75 | 7 | 611 | 0.05 | 0.24 | 1.02 | 0.31 |
| 1 | 1 | 76 | 17 | 438 | 0.06 | -1.04 | 0.15 | 0.31 |
| 1 | 1 | 77 | 14 | 645 | 0.05 | 0.28 | 1.13 | -0.55 |
| 1 | 1 | 78 | 10 | 751 | 0.09 | 2.87 | 0.29 | -0.45 |
| 1 | 1 | 79 | 8 | 157 | 0.09 | 1.47 | -1.93 | 0.90 |
| 1 | 1 | 80 | 7 | 875 | 0.05 | 1.26 | 2.49 | -0.61 |
| 1 | 1 | 81 | 16 | 575 | 0.06 | -0.33 | 0.97 | -0.23 |
| 1 | 1 | 82 | 5 | 84 | 0.05 | 1.81 | -2.36 | 0.18 |
| 1 | 1 | 83 | 20 | 572 | 0.08 | -0.45 | 1.01 | 0.45 |
| 1 | 1 | 84 | 8 | 640 | 0.06 | 0.01 | 1.24 | 0.65 |
| 1 | 1 | 85 | 15 | 112 | 0.13 | 1.85 | -2.19 | -0.49 |
| 1 | 1 | 86 | 12 | 224 | 0.07 | -0.33 | -1.42 | -0.84 |
| 1 | 1 | 87 | 14 | 35 | 0.12 | -1.04 | -2.56 | -0.63 |
| 1 | 1 | 88 | 14 | 476 | 0.08 | -1.06 | 0.47 | 0.54 |
| 1 | 1 | 89 | 13 | 574 | 0.08 | -0.30 | 0.98 | 0.22 |
| 1 | 1 | 90 | 14 | 47 | 0.1 | 0.18 | -2.76 | 0.63 |
| 1 | 1 | 91 | 10 | 517 | 0.05 | 1.01 | 0.15 | 0.19 |
| 1 | 1 | 92 | 17 | 71 | 0.13 | -1.33 | -2.15 | -0.84 |
| 1 | 1 | 93 | 18 | 286 | 0.08 | -2.06 | -0.24 | -0.99 |
| 1 | 1 | 94 | 6 | 894 | 0.06 | 2.08 | 2.08 | -0.32 |
| 1 | 1 | 95 | 8 | 807 | 0.13 | -1.11 | 2.58 | 0.58 |
| 1 | 1 | 96 | 11 | 29 | 0.11 | -0.11 | -2.87 | -0.47 |
| 1 | 1 | 97 | 4 | 719 | 0.04 | -2.03 | 1.81 | -0.69 |
| 1 | 1 | 98 | 9 | 889 | 0.11 | 2.21 | 1.95 | 0.30 |
| 1 | 1 | 99 | 7 | 22 | 0.07 | 0.06 | -2.99 | 0.02 |
| 1 | 1 | 100 | 6 | 186 | 0.05 | -0.53 | -1.52 | 0.92 |
Now let us understand what each column in the above table means:
Segment.Level - Level of the cell.
In this case, we have performed Vector Quantization for depth 1. Hence
Segment Level is 1.
Segment.Parent - Parent segment of
the cell.
Segment.Child (Cell.Number) - The
children of a particular cell. In this case, it is the total number of
cells at which we achieved the defined compression percentage.
n - No of points in each
cell.
Cell.ID - Cell_ID’s are generated
for the multivariate data using 1-D Sammon’s Projection
algorithm.
Quant.Error - Quantization Error
for each cell.
All the columns after this will contain centroids for each cell. They can also be called a codebook, which represents a collection of all centroids or codewords.
Now let’s try to understand plotHVT function. The parameters have been explained in detail below:
plotHVT <-(hvt.results, line.width, color.vec, pch1 = 21, palette.color = 6, centroid.size = 1.5, title = NULL, maxDepth = NULL, dataset, child.level, hmap.cols, previous_level_heatmap = TRUE, show.points = FALSE, asp = 1, ask = TRUE, tess.label = NULL, quant.error.hmap = NULL, n_cells.hmap = NULL, label.size = 0.5, sepration_width = 7, layer_opacity = c(0.5, 0.75, 0.99), dim_size = 1000, heatmap = '2Dhvt') hvt.results -
(2Dhvt/2Dheatmap/surface_plot) A list obtained from the trainHVT
function while performing hierarchical vector quantization on training
dataset. This list provides an overview of the hierarchical vector
quantized data, including diagnostics, tessellation details, Sammon’s
projection coordinates, and model input information.
line.width - (2Dhvt/2Dheatmap) A
vector indicating the line widths of the tessellation boundaries for
each layer.
color.vec - (2Dhvt/2Dheatmap) A
vector indicating the colors of the tessellations boundaries at each
layer.
pch1 - (2Dhvt/2Dheatmap) Symbol, It
plots the centroids with a particular symbol such as (solid circle,
bullet, filled square, filled diamond) in the tessellations.(default =
21 i.e filled circle).
centroid.size - (2Dhvt/2Dheatmap)
Size of centroids for each level of tessellations (default =
3).
title - (2Dhvt) Set a title for the
plot (default = NULL).
maxDepth - (2Dhvt) An integer
indicating the number of levels. (default = NULL)
palette.color - (2Dheatmap) A
number indicating the heat map color palette. 1 - rainbow,2 -
heat.colors, 3 - terrain.colors, 4 - topo.colors, 5 - cm.colors, 6 -
BlCyGrYlRd (Blue,Cyan,Green,Yellow,Red) color (default= 6).
dataset - (2Dheatmap) Data frame.
The input dataset.
child.level -
(2Dheatmap/surface_plot) A Number indicating the level for which the
heat map is to be plotted.
hmap.cols -
(2Dheatmap/surface_plot) A Number or a Character which is the column
number or column name from the dataset indicating the variables for
which the heat map is to be plotted.
previous_level_heatmap -
(2Dheatmap) A Logical value to indicate whether the heatmap of previous
level will be overlayed on the heatmap of selected level.
show.points - (2Dheatmap) A Logical
value indicating if the centroids should be plotted on the
tessellations. (default = FALSE)
asp - (2Dheatmap) A Number
indicating the aspect ratio type. For flexible aspect ratio set, asp =
NA. (default = 1)
ask - (2Dheatmap) A Logical value
to have interactive R session.(default = TRUE)
tess.label - (2Dheatmap) A vector
for labelling the tessellations. (default = NULL)
label.size - (2Dheatmap) The size
by which the tessellation labels should be scaled.(default =
0.5)
quant.error.hmap - (2Dheatmap) A
number indicating the quantization error threshold.
n_cells.hmap - (2Dheatmap) An
integer indicating the number of cells/clusters per hierarchy
sepration_width - (surface_plot) An
integer indicating the width between two Levels.
layer_opacity - (surface_plot) A
vector indicating the opacity of each layer/ level.
dim_size - (surface_plot) An
integer indicating the dimension size used to create the matrix for the
plot.
heatmap - A Character indicating
which type of plot should be generated. Accepted entries are
‘1D’,‘2Dhvt’,‘2Dheatmap’, ‘surface_plot’. Default value is
‘2Dhvt’.
Let’s plot the Voronoi tessellation for layer 1 (map A).
plotHVT(torus_mapA,
line.width = c(0.4),
color.vec = c("#141B41"),
centroid.size = 0.01,
maxDepth = 1,
heatmap = '2Dhvt') Figure 2: The Voronoi Tessellation for layer 1 (map A) shown for the 900 cells in the dataset ’torus’
Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the torus dataset for better visualization and interpretation of data patterns and distributions.
The heatmaps displayed below provides a visual representation of the spatial characteristics of the torus dataset, allowing us to observe patterns and trends in the distribution of each of the features (x,y,z). The sheer green shades highlight regions with higher values in each of the heatmaps, while the indigo shades indicate areas with the lowest values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the torus dataset.
plotHVT(
torus_mapA,
torus_train,
child.level = 1,
hmap.cols = "x",
line.width = c(0.2),
color.vec = c("#141B41"),
palette.color = 6,
centroid.size = 0.1,
show.points = TRUE,
quant.error.hmap = 0.2,
n_cells.hmap = 900,
heatmap = '2Dheatmap'
) Figure 4: The Voronoi Tessellation with the heat map overlaid for variable ’x’ in the ’torus’ dataset
plotHVT(
torus_mapA,
torus_train,
child.level = 1,
hmap.cols = "y",
line.width = c(0.2),
color.vec = c("#141B41"),
palette.color = 6,
centroid.size = 0.1,
show.points = TRUE,
quant.error.hmap = 0.2,
n_cells.hmap = 900,
heatmap = '2Dheatmap'
) Figure 5: The Voronoi Tessellation with the heat map overlaid for variable ’y’ in the ’torus’ dataset
plotHVT(
torus_mapA,
torus_train,
child.level = 1,
hmap.cols = "z",
line.width = c(0.2),
color.vec = c("#141B41"),
palette.color = 6,
centroid.size = 0.1,
show.points = TRUE,
quant.error.hmap = 0.2,
n_cells.hmap = 440,
heatmap = '2Dheatmap'
) Figure 6: The Voronoi Tessellation with the heat map overlaid for variable ’z’ in the ’torus’ dataset
Let us try to visualize the Map B from the diagram below.
Figure 10: Data Segregation with highlighted bounding box in red around map B
In this section, we will manually figure out the novelty cells from the plotted torus_mapA and store it in identified_Novelty_cells variable.
Note: For manual selecting the novelty cells from
map A, one can enhance its interactivity by adding plotly elements to
the code. This will transform map A into an interactive plot, allowing
users to actively engage with the data. By hovering over the centroids
of the cells, a tag containing segment child information
will be displayed. Users can explore the map by hovering over different
cells and selectively choose the novelty cells they wish to consider.
Added an image for reference.
Figure 11: Manually selecting novelty cells
The removeNovelty function removes the
identified novelty cell(s) from the training dataset (containing 9600
datapoints) and stores those records separately.
It takes input as the cell number (Segment.Child) of the manually
identified novelty cell(s) and the compressed HVT map (torus_mapA) with
900 cells. It returns a list of two items:
data with novelty, and
data without novelty.
NOTE: As we are using torus dataset here, the
identified novelty cells given are for demo purpose.
identified_Novelty_cells <<- c(595,425,165,875,822,697,166) #as a example
output_list <- removeNovelty(identified_Novelty_cells, torus_mapA)
data_with_novelty <- output_list[[1]]
data_without_novelty <- output_list[[2]]Let’s have a look at the data with novelty(containing 83 records).
novelty_data <- data_with_novelty
novelty_data$Row.No <- row.names(novelty_data)
novelty_data <- novelty_data %>% dplyr::select("Row.No","Cell.ID","Cell.Number","x","y","z")
colnames(novelty_data) <- c("Row.No","Cell.ID","Segment.Child","x","y","z")
Table(novelty_data,scroll = TRUE, limit = 100)| Row.No | Cell.ID | Segment.Child | x | y | z |
|---|---|---|---|---|---|
| 1 | 64 | 165 | 1.3324 | -2.6588 | -0.2268 |
| 2 | 64 | 165 | 1.5139 | -2.5820 | 0.1175 |
| 3 | 64 | 165 | 1.4286 | -2.6364 | -0.0536 |
| 4 | 64 | 165 | 1.4241 | -2.6117 | -0.2235 |
| 5 | 64 | 165 | 1.4838 | -2.6035 | 0.0825 |
| 6 | 64 | 165 | 1.6572 | -2.5004 | -0.0235 |
| 7 | 64 | 165 | 1.3774 | -2.6605 | 0.0900 |
| 8 | 64 | 165 | 1.3506 | -2.6740 | -0.0925 |
| 9 | 64 | 165 | 1.4585 | -2.5935 | -0.2201 |
| 10 | 64 | 165 | 1.4487 | -2.6202 | -0.1093 |
| 11 | 64 | 165 | 1.3985 | -2.6531 | -0.0424 |
| 12 | 64 | 165 | 1.4575 | -2.5965 | -0.2104 |
| 13 | 64 | 165 | 1.5355 | -2.5772 | 0.0102 |
| 14 | 64 | 165 | 1.5240 | -2.5769 | -0.1107 |
| 15 | 819 | 166 | -1.4562 | 2.5939 | -0.2235 |
| 16 | 819 | 166 | -1.4989 | 2.5819 | -0.1701 |
| 17 | 819 | 166 | -1.3234 | 2.6788 | -0.1553 |
| 18 | 819 | 166 | -1.3064 | 2.6859 | -0.1623 |
| 19 | 819 | 166 | -1.4328 | 2.6352 | 0.0303 |
| 20 | 819 | 166 | -1.4572 | 2.6197 | 0.0670 |
| 21 | 819 | 166 | -1.6413 | 2.5055 | -0.0980 |
| 22 | 819 | 166 | -1.5168 | 2.5865 | 0.0552 |
| 23 | 819 | 166 | -1.5143 | 2.5889 | -0.0375 |
| 24 | 819 | 166 | -1.3334 | 2.6861 | -0.0478 |
| 25 | 819 | 166 | -1.3388 | 2.6804 | -0.0876 |
| 26 | 819 | 166 | -1.4489 | 2.6259 | -0.0421 |
| 27 | 839 | 425 | 2.8302 | 0.9714 | 0.1239 |
| 28 | 839 | 425 | 2.7954 | 1.0744 | 0.1024 |
| 29 | 839 | 425 | 2.7172 | 1.2425 | 0.1556 |
| 30 | 839 | 425 | 2.7929 | 1.0780 | -0.1123 |
| 31 | 839 | 425 | 2.8059 | 1.0436 | 0.1121 |
| 32 | 839 | 425 | 2.8309 | 0.9756 | -0.1071 |
| 33 | 839 | 425 | 2.7778 | 1.1314 | 0.0350 |
| 34 | 839 | 425 | 2.7531 | 1.1907 | 0.0281 |
| 35 | 839 | 425 | 2.8049 | 1.0642 | -0.0066 |
| 36 | 839 | 425 | 2.8666 | 0.8810 | 0.0472 |
| 37 | 839 | 425 | 2.7736 | 1.1191 | 0.1346 |
| 38 | 839 | 425 | 2.8352 | 0.9279 | 0.1828 |
| 39 | 839 | 425 | 2.7399 | 1.1841 | 0.1734 |
| 40 | 839 | 425 | 2.8198 | 1.0163 | 0.0732 |
| 41 | 839 | 425 | 2.8123 | 1.0309 | 0.0965 |
| 42 | 800 | 595 | 2.9378 | 0.6073 | -0.0134 |
| 43 | 800 | 595 | 2.9338 | 0.5885 | 0.1242 |
| 44 | 800 | 595 | 2.9436 | 0.5597 | 0.0852 |
| 45 | 800 | 595 | 2.8827 | 0.7950 | 0.1391 |
| 46 | 800 | 595 | 2.9308 | 0.5937 | 0.1390 |
| 47 | 800 | 595 | 2.9066 | 0.7312 | 0.0755 |
| 48 | 800 | 595 | 2.9099 | 0.5923 | 0.2447 |
| 49 | 800 | 595 | 2.9383 | 0.6049 | 0.0126 |
| 50 | 800 | 595 | 2.9290 | 0.6330 | -0.0821 |
| 51 | 800 | 595 | 2.9056 | 0.5665 | 0.2788 |
| 52 | 800 | 595 | 2.9156 | 0.7037 | -0.0376 |
| 53 | 39 | 697 | -2.5955 | -1.4823 | 0.1483 |
| 54 | 39 | 697 | -2.6052 | -1.4771 | 0.1019 |
| 55 | 39 | 697 | -2.5256 | -1.6157 | 0.0609 |
| 56 | 39 | 697 | -2.6147 | -1.4575 | 0.1138 |
| 57 | 39 | 697 | -2.5697 | -1.5461 | 0.0466 |
| 58 | 39 | 697 | -2.5655 | -1.5213 | 0.1855 |
| 59 | 39 | 697 | -2.5258 | -1.5891 | 0.1775 |
| 60 | 39 | 697 | -2.4889 | -1.6570 | 0.1411 |
| 61 | 39 | 697 | -2.5673 | -1.5518 | -0.0157 |
| 62 | 79 | 822 | -2.7984 | -1.0772 | 0.0543 |
| 63 | 79 | 822 | -2.7636 | -1.1210 | -0.1873 |
| 64 | 79 | 822 | -2.7651 | -1.1357 | -0.1463 |
| 65 | 79 | 822 | -2.7255 | -1.2424 | -0.0963 |
| 66 | 79 | 822 | -2.8309 | -0.9764 | 0.1042 |
| 67 | 79 | 822 | -2.8204 | -0.9948 | -0.1359 |
| 68 | 79 | 822 | -2.7876 | -1.1062 | 0.0432 |
| 69 | 79 | 822 | -2.7586 | -1.1768 | -0.0423 |
| 70 | 79 | 822 | -2.7917 | -1.0899 | -0.0787 |
| 71 | 36 | 875 | 0.7839 | -2.8918 | 0.0879 |
| 72 | 36 | 875 | 0.4498 | -2.9486 | 0.1852 |
| 73 | 36 | 875 | 0.5570 | -2.9287 | 0.1932 |
| 74 | 36 | 875 | 0.6597 | -2.9266 | 0.0029 |
| 75 | 36 | 875 | 0.7650 | -2.9007 | 0.0126 |
| 76 | 36 | 875 | 0.6418 | -2.9220 | 0.1290 |
| 77 | 36 | 875 | 0.6082 | -2.9195 | 0.1879 |
| 78 | 36 | 875 | 0.6415 | -2.8840 | 0.2982 |
| 79 | 36 | 875 | 0.4248 | -2.9574 | 0.1560 |
| 80 | 36 | 875 | 0.6271 | -2.9079 | 0.2233 |
| 81 | 36 | 875 | 0.4684 | -2.9614 | 0.0604 |
| 82 | 36 | 875 | 0.6822 | -2.8930 | 0.2337 |
| 83 | 36 | 875 | 0.5300 | -2.9477 | 0.1004 |
The plotNovelCells function is used to
plot the Voronoi tessellation using the compressed HVT map (torus_mapA)
containing 900 cells and highlights the identified novelty cell(s) i.e 7
cells (containing 83 records) in red on the map.
plotNovelCells(identified_Novelty_cells, torus_mapA,line.width = c(0.4),centroid.size = 0.01)Figure 12: The Voronoi Tessellation constructed using the compressed HVT map (map A) with the novelty cell(s) highlighted in red
We pass the dataframe with novelty records (83 records) to trainHVT function along with other model parameters mentioned below to generate map B (layer2)
Model Parameters
colnames(data_with_novelty) <- c("Cell.ID","Segment.Child","x","y","z")
data_with_novelty <- data_with_novelty[,-1:-2]
torus_mapB <- list()
mapA_scale_summary = torus_mapA[[3]]$scale_summary
torus_mapB <- trainHVT(data_with_novelty,
n_cells = 11,
depth = 1,
quant.err = 0.1,
projection.scale = 10,
normalize = FALSE,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans"
)The datatable displayed below is the summary from map B (layer 2) showing Cell.ID, Centroids and Quantization Error for each of the 11 cells.
summaryTable(torus_mapB[[3]]$summary,scroll = TRUE)| Segment.Level | Segment.Parent | Segment.Child | n | Cell.ID | Quant.Error | x | y | z |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 8 | 9 | 0.08 | 0.68 | -2.91 | 0.15 |
| 1 | 1 | 2 | 9 | 3 | 0.11 | -2.78 | -1.10 | -0.05 |
| 1 | 1 | 3 | 5 | 5 | 0.05 | 2.75 | 1.17 | 0.11 |
| 1 | 1 | 4 | 7 | 1 | 0.1 | -1.50 | 2.59 | -0.03 |
| 1 | 1 | 5 | 10 | 6 | 0.09 | 2.82 | 1.01 | 0.05 |
| 1 | 1 | 6 | 9 | 4 | 0.07 | -2.56 | -1.54 | 0.11 |
| 1 | 1 | 7 | 5 | 8 | 0.05 | 0.49 | -2.95 | 0.14 |
| 1 | 1 | 8 | 5 | 2 | 0.09 | -1.35 | 2.67 | -0.14 |
| 1 | 1 | 9 | 9 | 10 | 0.07 | 1.42 | -2.62 | -0.14 |
| 1 | 1 | 10 | 11 | 7 | 0.09 | 2.92 | 0.63 | 0.09 |
| 1 | 1 | 11 | 5 | 11 | 0.1 | 1.51 | -2.58 | 0.06 |
Now let’s check the compression summary for HVT (torus_mapB). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.
mapB_compression_summary <- torus_mapB[[3]]$compression_summary %>% dplyr::mutate_if(is.numeric, funs(round(.,4)))
compressionSummaryTable(mapB_compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 11 | 9 | 0.82 | n_cells: 11 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
As it can be seen from the table above,
82% of the cells have hit the quantization
threshold error.Since we are successfully able to attain the desired
compression percentage, so we will not further subdivide the cells
Let us try to visualize the compressed Map C from the diagram below.
Figure 13:Data Segregation with highlighted bounding box in red around compressed map C
With the Novelties removed, we construct another hierarchical Voronoi tessellation map C layer 2 on the data without Novelty (containing 9517 records) and below mentioned model parameters.
Model Parameters
torus_mapC <- list()
mapA_scale_summary = torus_mapA[[3]]$scale_summary
torus_mapC <- trainHVT(data_without_novelty,
n_cells = 10,
depth = 2,
quant.err = 0.1,
projection.scale = 10,
normalize = FALSE,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans",
diagnose = FALSE,
scale_summary = mapA_scale_summary)Now let’s check the compression summary for HVT (torus_mapC) where n_cell was set to 10. The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.
mapC_compression_summary <- torus_mapC[[3]]$compression_summary %>% dplyr::mutate_if(is.numeric, funs(round(.,4)))
compressionSummaryTable(mapC_compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 10 | 0 | 0 | n_cells: 10 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
| 2 | 100 | 0 | 0 | n_cells: 10 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
As it can be seen from the table above,
0% of the cells have hit the quantization
threshold error in level 1 and 0% of the
cells have hit the quantization threshold error in level 2
Since, we are yet to achive atleast 80% compression at depth 2. Let’s try to compress again using the below mentioned set of model parameters and the data without novelty (containing 9517 records).
Model Parameters
torus_mapC <- list()
torus_mapC <- trainHVT(data_without_novelty,
n_cells = 30,
depth = 2,
quant.err = 0.1,
projection.scale = 10,
normalize = FALSE,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans",
diagnose = FALSE,
scale_summary = mapA_scale_summary)The datatable displayed below is the summary from map C (layer2). showing Cell.ID, Centroids and Quantization Error for each of the 928 cells.
summaryTable(torus_mapC[[3]]$summary,scroll = TRUE,limit = 100)| Segment.Level | Segment.Parent | Segment.Child | n | Cell.ID | Quant.Error | x | y | z |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 426 | 613 | 0.5 | 0.84 | -0.88 | 0.56 |
| 1 | 1 | 2 | 363 | 590 | 0.48 | 0.96 | 1.32 | -0.85 |
| 1 | 1 | 3 | 321 | 236 | 0.6 | -0.94 | 1.78 | -0.88 |
| 1 | 1 | 4 | 311 | 628 | 0.67 | 0.48 | -2.29 | 0.78 |
| 1 | 1 | 5 | 286 | 819 | 0.56 | 2.40 | 0.88 | -0.62 |
| 1 | 1 | 6 | 424 | 584 | 0.49 | 0.92 | 0.51 | 0.26 |
| 1 | 1 | 7 | 269 | 727 | 0.58 | 1.50 | 2.39 | 0.05 |
| 1 | 1 | 8 | 261 | 55 | 0.54 | -2.31 | 1.42 | -0.46 |
| 1 | 1 | 9 | 269 | 886 | 0.51 | 2.68 | -0.71 | -0.34 |
| 1 | 1 | 10 | 250 | 122 | 0.54 | -2.32 | -1.17 | 0.62 |
| 1 | 1 | 11 | 358 | 261 | 0.45 | -1.32 | -0.25 | 0.67 |
| 1 | 1 | 12 | 373 | 685 | 0.47 | 1.34 | -0.30 | -0.72 |
| 1 | 1 | 13 | 399 | 430 | 0.44 | -0.10 | 1.02 | -0.20 |
| 1 | 1 | 14 | 350 | 244 | 0.59 | -1.45 | -1.39 | -0.89 |
| 1 | 1 | 15 | 353 | 287 | 0.51 | -0.97 | 0.85 | 0.62 |
| 1 | 1 | 16 | 282 | 386 | 0.43 | -0.74 | -1.55 | 0.88 |
| 1 | 1 | 17 | 253 | 89 | 0.55 | -2.39 | 0.56 | 0.76 |
| 1 | 1 | 18 | 287 | 728 | 0.58 | 1.75 | 1.29 | 0.86 |
| 1 | 1 | 19 | 263 | 777 | 0.62 | 1.50 | -1.82 | -0.81 |
| 1 | 1 | 20 | 259 | 563 | 0.57 | 0.09 | -2.61 | -0.58 |
| 1 | 1 | 21 | 242 | 210 | 0.6 | -1.46 | -2.44 | 0.04 |
| 1 | 1 | 22 | 384 | 443 | 0.54 | 0.13 | 1.87 | 0.87 |
| 1 | 1 | 23 | 293 | 147 | 0.57 | -1.41 | 2.13 | 0.66 |
| 1 | 1 | 24 | 302 | 807 | 0.6 | 2.24 | -0.01 | 0.82 |
| 1 | 1 | 25 | 341 | 538 | 0.53 | 0.27 | -1.23 | -0.64 |
| 1 | 1 | 26 | 266 | 391 | 0.56 | -0.03 | 2.71 | -0.39 |
| 1 | 1 | 27 | 255 | 839 | 0.64 | 1.99 | -1.75 | 0.55 |
| 1 | 1 | 28 | 378 | 406 | 0.48 | -0.57 | -0.83 | -0.03 |
| 1 | 1 | 29 | 265 | 77 | 0.54 | -2.58 | -0.30 | -0.61 |
| 1 | 1 | 30 | 434 | 250 | 0.5 | -1.30 | 0.19 | -0.67 |
| 2 | 1 | 1 | 18 | 626 | 0.12 | 1.01 | -0.31 | 0.33 |
| 2 | 1 | 2 | 12 | 710 | 0.05 | 1.44 | -0.76 | 0.93 |
| 2 | 1 | 3 | 13 | 573 | 0.07 | 0.62 | -0.86 | 0.35 |
| 2 | 1 | 4 | 15 | 574 | 0.07 | 0.54 | -1.25 | 0.77 |
| 2 | 1 | 5 | 13 | 612 | 0.08 | 0.87 | -0.68 | 0.45 |
| 2 | 1 | 6 | 17 | 567 | 0.07 | 0.61 | -0.80 | 0.07 |
| 2 | 1 | 7 | 12 | 701 | 0.08 | 1.42 | -0.53 | 0.87 |
| 2 | 1 | 8 | 13 | 634 | 0.05 | 1.03 | -0.49 | 0.51 |
| 2 | 1 | 9 | 11 | 539 | 0.08 | 0.29 | -1.38 | 0.81 |
| 2 | 1 | 10 | 19 | 601 | 0.09 | 0.79 | -0.90 | 0.59 |
| 2 | 1 | 11 | 17 | 609 | 0.07 | 0.90 | -0.52 | 0.27 |
| 2 | 1 | 12 | 8 | 744 | 0.06 | 1.63 | -0.87 | 0.99 |
| 2 | 1 | 13 | 11 | 671 | 0.09 | 1.30 | -0.32 | 0.74 |
| 2 | 1 | 14 | 21 | 542 | 0.07 | 0.42 | -0.92 | 0.15 |
| 2 | 1 | 15 | 13 | 591 | 0.08 | 0.80 | -0.60 | -0.02 |
| 2 | 1 | 16 | 19 | 650 | 0.09 | 0.99 | -1.02 | 0.81 |
| 2 | 1 | 17 | 19 | 611 | 0.11 | 0.66 | -1.41 | 0.89 |
| 2 | 1 | 18 | 16 | 674 | 0.11 | 1.04 | -1.42 | 0.97 |
| 2 | 1 | 19 | 11 | 707 | 0.08 | 1.25 | -1.27 | 0.97 |
| 2 | 1 | 20 | 13 | 646 | 0.07 | 1.06 | -0.69 | 0.68 |
| 2 | 1 | 21 | 17 | 557 | 0.11 | 0.51 | -1.02 | 0.51 |
| 2 | 1 | 22 | 7 | 589 | 0.04 | 0.75 | -0.68 | -0.15 |
| 2 | 1 | 23 | 21 | 525 | 0.1 | 0.25 | -1.19 | 0.63 |
| 2 | 1 | 24 | 9 | 677 | 0.07 | 1.24 | -0.80 | 0.85 |
| 2 | 1 | 25 | 11 | 726 | 0.06 | 1.45 | -1.11 | 0.98 |
| 2 | 1 | 26 | 10 | 592 | 0.07 | 0.79 | -0.65 | 0.22 |
| 2 | 1 | 27 | 17 | 623 | 0.08 | 0.80 | -1.17 | 0.81 |
| 2 | 1 | 28 | 13 | 662 | 0.08 | 1.19 | -0.44 | 0.68 |
| 2 | 1 | 29 | 20 | 524 | 0.09 | 0.24 | -1.02 | 0.29 |
| 2 | 1 | 30 | 10 | 608 | 0.06 | 0.93 | -0.36 | 0.07 |
| 2 | 2 | 1 | 8 | 667 | 0.06 | 1.34 | 0.81 | -0.90 |
| 2 | 2 | 2 | 26 | 512 | 0.1 | 0.49 | 1.26 | -0.75 |
| 2 | 2 | 3 | 10 | 653 | 0.09 | 1.24 | 1.35 | -0.98 |
| 2 | 2 | 4 | 15 | 587 | 0.08 | 0.94 | 0.75 | -0.60 |
| 2 | 2 | 5 | 7 | 621 | 0.06 | 1.08 | 0.57 | -0.63 |
| 2 | 2 | 6 | 17 | 739 | 0.12 | 1.72 | 1.66 | -0.92 |
| 2 | 2 | 7 | 5 | 541 | 0.05 | 0.70 | 2.20 | -0.95 |
| 2 | 2 | 8 | 13 | 630 | 0.07 | 1.15 | 0.99 | -0.87 |
| 2 | 2 | 9 | 9 | 665 | 0.06 | 1.33 | 1.09 | -0.96 |
| 2 | 2 | 10 | 9 | 514 | 0.07 | 0.51 | 1.91 | -1.00 |
| 2 | 2 | 11 | 12 | 599 | 0.07 | 1.00 | 1.56 | -0.99 |
| 2 | 2 | 12 | 16 | 513 | 0.09 | 0.50 | 1.53 | -0.92 |
| 2 | 2 | 13 | 11 | 669 | 0.1 | 1.34 | 0.65 | -0.86 |
| 2 | 2 | 14 | 15 | 588 | 0.06 | 0.95 | 1.05 | -0.81 |
| 2 | 2 | 15 | 15 | 529 | 0.11 | 0.61 | 0.97 | -0.51 |
| 2 | 2 | 16 | 11 | 679 | 0.09 | 1.36 | 1.98 | -0.91 |
| 2 | 2 | 17 | 13 | 614 | 0.1 | 1.01 | 1.85 | -0.99 |
| 2 | 2 | 18 | 13 | 691 | 0.09 | 1.43 | 1.60 | -0.98 |
| 2 | 2 | 19 | 17 | 553 | 0.1 | 0.77 | 0.87 | -0.54 |
| 2 | 2 | 20 | 10 | 631 | 0.06 | 1.07 | 2.07 | -0.94 |
| 2 | 2 | 21 | 7 | 615 | 0.05 | 0.98 | 2.27 | -0.88 |
| 2 | 2 | 22 | 12 | 712 | 0.09 | 1.62 | 1.19 | -0.99 |
| 2 | 2 | 23 | 14 | 555 | 0.06 | 0.79 | 1.42 | -0.92 |
| 2 | 2 | 24 | 9 | 463 | 0.09 | 0.25 | 1.81 | -0.98 |
| 2 | 2 | 25 | 10 | 705 | 0.1 | 1.56 | 0.80 | -0.97 |
| 2 | 2 | 26 | 8 | 625 | 0.06 | 1.11 | 0.73 | -0.74 |
| 2 | 2 | 27 | 9 | 467 | 0.08 | 0.22 | 1.47 | -0.86 |
| 2 | 2 | 28 | 13 | 546 | 0.07 | 0.72 | 1.12 | -0.74 |
| 2 | 2 | 29 | 16 | 581 | 0.09 | 0.90 | 1.25 | -0.88 |
| 2 | 2 | 30 | 13 | 550 | 0.07 | 0.75 | 1.64 | -0.98 |
| 2 | 3 | 1 | 13 | 196 | 0.11 | -0.95 | 2.30 | -0.87 |
| 2 | 3 | 2 | 14 | 235 | 0.07 | -0.95 | 1.66 | -0.99 |
| 2 | 3 | 3 | 10 | 338 | 0.07 | -0.39 | 2.12 | -0.99 |
| 2 | 3 | 4 | 11 | 328 | 0.07 | -0.53 | 1.71 | -0.98 |
| 2 | 3 | 5 | 13 | 141 | 0.13 | -1.21 | 2.56 | -0.54 |
| 2 | 3 | 6 | 13 | 166 | 0.07 | -1.52 | 1.40 | -1.00 |
| 2 | 3 | 7 | 10 | 399 | 0.08 | -0.14 | 1.63 | -0.93 |
| 2 | 3 | 8 | 5 | 260 | 0.05 | -0.61 | 2.35 | -0.90 |
| 2 | 3 | 9 | 7 | 217 | 0.05 | -0.79 | 2.41 | -0.84 |
| 2 | 3 | 10 | 10 | 389 | 0.1 | -0.16 | 1.96 | -1.00 |
Now let’s check the compression summary for HVT (torus_mapC). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.
mapC_compression_summary <- torus_mapC[[3]]$compression_summary %>% dplyr::mutate_if(is.numeric, funs(round(.,4)))
compressionSummaryTable(mapC_compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 30 | 0 | 0 | n_cells: 30 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
| 2 | 898 | 739 | 0.82 | n_cells: 30 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
As it can be seen from the table above,
0% of the cells have hit the quantization
threshold error in level 1 and 82% of the
cells have hit the quantization threshold error in level 2
Let’s plot the Voronoi tessellation for layer 2 (map C)
plotHVT(torus_mapC,
line.width = c(0.4,0.2),
color.vec = c("#141B41","#0582CA"),
centroid.size = 0.1,
maxDepth = 2,
heatmap = '2Dhvt') Figure 14: The Voronoi Tessellation for layer 2 (map C) shown for the 928 cells in the dataset ’torus’ at level 2
Heatmaps
Now let’s plot all the features for each cell at level two as a heatmap for better visualization.
The heatmaps displayed below provides a visual representation of the spatial characteristics of the torus dataset, allowing us to observe patterns and trends in the distribution of each of the features (x,y,z). The sheer green shades highlight regions with higher values in each of the heatmaps, while the indigo shades indicate areas with the lowest values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the torus dataset.
plotHVT(
torus_mapC,
torus_train,
child.level = 2,
hmap.cols = "x",
line.width = c(0.6,0.4),
color.vec = c("#141B41","#0582CA"),
palette.color = 6,
centroid.size = 0.1,
show.points = TRUE,
quant.error.hmap = 0.2,
heatmap = '2Dheatmap'
)
Figure 15: The Voronoi Tessellation with the heat map overlaid for
feature x in the ’torus’ dataset
plotHVT(
torus_mapC,
torus_train,
child.level = 2,
hmap.cols = "y",
line.width = c(0.6,0.4),
color.vec = c("#141B41","#0582CA"),
palette.color = 6,
centroid.size = 0.1,
show.points = TRUE,
quant.error.hmap = 0.2,
heatmap = '2Dheatmap'
)
Figure 16: The Voronoi Tessellation with the heat map overlaid for
feature y in the ’torus’ dataset
plotHVT(
torus_mapC,
torus_train,
child.level = 2,
hmap.cols = "z",
line.width = c(0.6,0.4),
color.vec = c("#141B41","#0582CA"),
palette.color = 6,
centroid.size = 0.1,
show.points = TRUE,
quant.error.hmap = 0.2,
heatmap = '2Dheatmap'
)
Figure 17: The Voronoi Tessellation with the heat map overlaid for
feature z in the ’torus’ dataset
We now have the set of maps (map A, map B & map C) which will be used to score, which map and cell each test record is assigned to.
Now once we have built the model, let us try to score using our testing dataset (containing 2400 data points) which cell and which layer each point belongs to.
The scoreLayeredHVT function is used to score the testing dataset using the scored set of maps. This function takes an input - a testing dataset and a set of maps (map A, map B, map C).
Now, Let us understand the
scoreLayeredHVT function.
scoreLayeredHVT(data,
map_A,
map_B,
map_C,
mad.threshold = 0.2,
normalize = TRUE,
distance_metric="L1_Norm",
error_metric="max",
child.level = 1,
line.width = c(0.6, 0.4, 0.2),
color.vec = c("#141B41", "#6369D1", "#D8D2E1"),
yVar= NULL,
...)Each of the parameters of scoreLayeredHVT function has been explained below:
data - A dataframe containing the
test dataset. The dataframe should have all the variable(features) used
for training.
map A - Result obtained from
trainHVT function while performing hierarchical vector quantization on
train data. This list containes information about the hierarchical
vector quantized data along with a summary section.
map B - Result obtained from
trainHVT function while performing hierarchical vector quantization on
novelty data. Novelty data is a subset of the training data obtained as
a result of removeNovelty function (1st element).
map C - Result obtained from
trainHVT function while performing hierarchical vector quantization on
data without novelty. This data is a subset of the training data
obtained as a result of removeNovelty function (2nd element).
child.level - A number indicating
the depth for which the heat map is to be plotted (Only used if
hmap.cols is not NULL), Each depth represents a different level of
clustering or partitioning of the data.
mad.threshold - A threshold value
used to filter data based on the Median Absolute Deviation (MAD) of the
Quant.Error variable. It determines how extreme a deviation from the
median has to be in order to consider as novelty
normalize - A logical value
indicating if the dataset should be normalized. When set to TRUE, the
data (testing dataset) is standardized by mean and sd of the training
dataset referred from the trainHVT(). When set to FALSE, the data is
used as such without any changes.
distance_metric - The distance
metric can be L1_Norm(Manhattan) or
L2_Norm(Eucledian). L1_Norm is selected by
default. The distance metric is used to calculate the distance between
an n dimensional point and centroid. The distance metric
can be different from the one used during training.
error_metric - The error metric can
be mean or max. max is selected
by default. max will return the max of m
values and mean will take mean of m values
where each value is a distance between a point and centroid of the cell.
The error metric can be different from the one used during
training.
yVar - A character or a vector
representing the name of the dependent variable(s)
line.width - A vector indicating
the line widths of the tessellation boundaries for each layer. (Optional
Parameters)
color.vec - A vector indicating the
colors of the tessellations boundaries at each layer. (Optional
Parameters)
The function score based on the HVT maps - map A, map B and map C, constructed using trainHVT function. For each test record, the function will assign that record to Layer1 or Layer2. Layer1 contains the cell ids from map A and Layer 2 contains cell ids from map B (novelty map) and map C (map without novelty).
Scoring Algorithm
The Scoring algorithm recursively calculates the distance between each point in the testing dataset and the cell centroids for each level. The following steps explain the scoring method for a single point in the test dataset:
Note : The Scoring algorithm will not work if some of the variables used to perform quantization are missing. In the testing dataset, we should not remove any features
validation_data <- torus_test
new_score <- scoreLayeredHVT(
data=validation_data,
torus_mapA,
torus_mapB,
torus_mapC,
normalize = FALSE
)Let’s see which cell and layer each point belongs to and check the Mean Absolute Difference for each of the 2400 records. For the sake of brevity, we are only displaying the first 100 rows.
act_pred <- new_score[["actual_predictedTable"]]
rownames(act_pred) <- NULL
act_pred %>% head(100) %>%as.data.frame() %>%Table(scroll = TRUE)| Row.Number | act_x | act_y | act_z | Layer1.Cell.ID | Layer2.Cell.ID | pred_x | pred_y | pred_z | diff |
|---|---|---|---|---|---|---|---|---|---|
| 1 | -2.6282 | 0.5656 | -0.7253 | A426 | C77 | -2.5813521 | -0.2999468 | -0.6123004 | 0.3417981 |
| 2 | 2.7471 | -0.9987 | -0.3848 | A383 | C886 | 2.6772045 | -0.7129922 | -0.3361825 | 0.1347403 |
| 3 | -2.4446 | -1.6528 | 0.3097 | A43 | C122 | -2.3163828 | -1.1699452 | 0.6168016 | 0.3060579 |
| 4 | -2.6487 | -0.5745 | 0.7040 | A137 | C122 | -2.3163828 | -1.1699452 | 0.6168016 | 0.3383203 |
| 5 | -0.2676 | -1.0800 | -0.4611 | A280 | C538 | 0.2727044 | -1.2341352 | -0.6402883 | 0.2912093 |
| 6 | -1.1130 | -0.6516 | -0.7040 | A302 | C250 | -1.3009074 | 0.1932009 | -0.6693836 | 0.3557749 |
| 7 | 2.0288 | 1.9519 | 0.5790 | A872 | C728 | 1.7540533 | 1.2884739 | 0.8592087 | 0.4061272 |
| 8 | -2.4799 | 1.6863 | -0.0470 | A706 | C55 | -2.3105632 | 1.4238165 | -0.4643663 | 0.2830622 |
| 9 | -0.4105 | -1.1610 | -0.6398 | A254 | C538 | 0.2727044 | -1.2341352 | -0.6402883 | 0.2522760 |
| 10 | -0.2545 | -1.6160 | -0.9314 | A177 | C538 | 0.2727044 | -1.2341352 | -0.6402883 | 0.4000603 |
| 11 | 1.1500 | 0.3945 | -0.6205 | A551 | C685 | 1.3422137 | -0.3020255 | -0.7171627 | 0.3284673 |
| 12 | -1.2557 | -1.1369 | 0.9520 | A179 | C386 | -0.7375099 | -1.5507936 | 0.8778894 | 0.3353981 |
| 13 | -0.5449 | -2.6892 | -0.6684 | A28 | C563 | 0.0872637 | -2.6105780 | -0.5796456 | 0.2665134 |
| 14 | 2.9093 | 0.7222 | -0.0697 | A800 | B7 | 2.9212455 | 0.6341636 | 0.0878182 | 0.0858333 |
| 15 | 2.3205 | 1.2520 | -0.7711 | A827 | C819 | 2.4028203 | 0.8791671 | -0.6244483 | 0.2006016 |
| 16 | 1.4772 | -0.5194 | -0.9008 | A461 | C685 | 1.3422137 | -0.3020255 | -0.7171627 | 0.1786660 |
| 17 | -1.3176 | -2.6541 | 0.2690 | A3 | C210 | -1.4626517 | -2.4376124 | 0.0398723 | 0.1968890 |
| 18 | 1.0687 | 0.1211 | -0.3812 | A513 | C685 | 1.3422137 | -0.3020255 | -0.7171627 | 0.3442006 |
| 19 | -0.9632 | 0.3283 | -0.1866 | A463 | C250 | -1.3009074 | 0.1932009 | -0.6693836 | 0.3185300 |
| 20 | 2.5616 | 0.4634 | 0.7976 | A761 | C807 | 2.2367156 | -0.0082026 | 0.8223858 | 0.2737576 |
| 21 | 2.8473 | -0.9303 | -0.0955 | A389 | C886 | 2.6772045 | -0.7129922 | -0.3361825 | 0.2093620 |
| 22 | -0.5293 | -0.8566 | 0.1173 | A320 | C406 | -0.5681405 | -0.8276460 | -0.0344534 | 0.0731826 |
| 23 | -1.9898 | -2.1766 | 0.3150 | A4 | C210 | -1.4626517 | -2.4376124 | 0.0398723 | 0.3544295 |
| 24 | -0.8845 | -1.2219 | -0.8709 | A243 | C244 | -1.4496154 | -1.3855974 | -0.8882494 | 0.2487208 |
| 25 | 0.1553 | 2.2566 | 0.9651 | A791 | C443 | 0.1333753 | 1.8734750 | 0.8721690 | 0.1659936 |
| 26 | 2.4262 | -0.6069 | -0.8655 | A459 | C886 | 2.6772045 | -0.7129922 | -0.3361825 | 0.2954714 |
| 27 | -0.0667 | -1.4627 | -0.8444 | A225 | C538 | 0.2727044 | -1.2341352 | -0.6402883 | 0.2573603 |
| 28 | -0.0655 | -1.3311 | -0.7448 | A268 | C538 | 0.2727044 | -1.2341352 | -0.6402883 | 0.1798936 |
| 29 | 1.9592 | 1.5104 | 0.8806 | A804 | C728 | 1.7540533 | 1.2884739 | 0.8592087 | 0.1494880 |
| 30 | 1.2332 | 2.5452 | 0.5603 | A865 | C727 | 1.5048349 | 2.3899870 | 0.0467405 | 0.3134691 |
| 31 | -0.8720 | 0.4903 | 0.0287 | A483 | C287 | -0.9738620 | 0.8474068 | 0.6197989 | 0.3500226 |
| 32 | 0.2194 | -1.7686 | 0.9760 | A159 | C628 | 0.4814534 | -2.2922875 | 0.7817855 | 0.3266518 |
| 33 | 1.5052 | 0.0445 | -0.8694 | A532 | C685 | 1.3422137 | -0.3020255 | -0.7171627 | 0.2205830 |
| 34 | -2.8410 | -0.8651 | 0.2439 | A103 | C122 | -2.3163828 | -1.1699452 | 0.6168016 | 0.4007880 |
| 35 | 1.3203 | -2.5967 | 0.4077 | A63 | C628 | 0.4814534 | -2.2922875 | 0.7817855 | 0.5057816 |
| 36 | -1.5648 | 1.5577 | 0.9781 | A650 | C147 | -1.4121276 | 2.1263891 | 0.6619017 | 0.3458532 |
| 37 | 0.3589 | -1.0419 | -0.4400 | A340 | C538 | 0.2727044 | -1.2341352 | -0.6402883 | 0.1595730 |
| 38 | -0.2900 | -2.0106 | 0.9995 | A130 | C386 | -0.7375099 | -1.5507936 | 0.8778894 | 0.3429757 |
| 39 | 0.5300 | 1.3668 | 0.8455 | A698 | C443 | 0.1333753 | 1.8734750 | 0.8721690 | 0.3099896 |
| 40 | 1.0254 | -0.6738 | 0.6344 | A409 | C613 | 0.8381096 | -0.8753467 | 0.5607136 | 0.1541745 |
| 41 | -0.9306 | 0.3664 | 0.0154 | A483 | C287 | -0.9738620 | 0.8474068 | 0.6197989 | 0.3762226 |
| 42 | 2.3888 | -1.0670 | 0.7875 | A411 | C807 | 2.2367156 | -0.0082026 | 0.8223858 | 0.4152558 |
| 43 | -0.9830 | -0.2043 | -0.0897 | A408 | C406 | -0.5681405 | -0.8276460 | -0.0344534 | 0.3644840 |
| 44 | 0.9499 | 0.3135 | 0.0261 | A541 | C584 | 0.9179814 | 0.5079842 | 0.2631802 | 0.1544943 |
| 45 | -1.8079 | -1.4936 | 0.9386 | A127 | C122 | -2.3163828 | -1.1699452 | 0.6168016 | 0.3846453 |
| 46 | 1.8399 | -1.9295 | -0.7459 | A160 | C777 | 1.4974529 | -1.8159289 | -0.8105544 | 0.1735575 |
| 47 | -0.3304 | -1.8481 | 0.9925 | A125 | C386 | -0.7375099 | -1.5507936 | 0.8778894 | 0.2730090 |
| 48 | -2.2806 | -1.8984 | 0.2536 | A15 | C122 | -2.3163828 | -1.1699452 | 0.6168016 | 0.3758131 |
| 49 | -2.3323 | 1.7320 | 0.4252 | A739 | C55 | -2.3105632 | 1.4238165 | -0.4643663 | 0.4064955 |
| 50 | 0.5520 | 0.8441 | 0.1308 | A593 | C584 | 0.9179814 | 0.5079842 | 0.2631802 | 0.2781591 |
| 51 | -0.9449 | 2.2273 | 0.9078 | A755 | C147 | -1.4121276 | 2.1263891 | 0.6619017 | 0.2713456 |
| 52 | 0.2334 | -1.4612 | -0.8540 | A214 | C538 | 0.2727044 | -1.2341352 | -0.6402883 | 0.1600270 |
| 53 | 2.7387 | 0.9703 | 0.4244 | A817 | C819 | 2.4028203 | 0.8791671 | -0.6244483 | 0.4919536 |
| 54 | 0.3561 | 1.1619 | -0.6199 | A645 | C590 | 0.9634804 | 1.3193923 | -0.8514567 | 0.3321432 |
| 55 | 1.7006 | 1.5569 | -0.9522 | A808 | C590 | 0.9634804 | 1.3193923 | -0.8514567 | 0.3584568 |
| 56 | 1.7244 | -0.5698 | 0.9829 | A467 | C807 | 2.2367156 | -0.0082026 | 0.8223858 | 0.4114757 |
| 57 | 0.9922 | 1.1438 | -0.8741 | A713 | C590 | 0.9634804 | 1.3193923 | -0.8514567 | 0.0756517 |
| 58 | -0.3022 | -1.3611 | 0.7956 | A227 | C386 | -0.7375099 | -1.5507936 | 0.8778894 | 0.2357643 |
| 59 | -0.9693 | 1.0602 | 0.8261 | A542 | C287 | -0.9738620 | 0.8474068 | 0.6197989 | 0.1412188 |
| 60 | 1.1313 | -0.3595 | -0.5824 | A485 | C685 | 1.3422137 | -0.3020255 | -0.7171627 | 0.1343836 |
| 61 | -0.7561 | -2.5384 | -0.7611 | A60 | C563 | 0.0872637 | -2.6105780 | -0.5796456 | 0.3656654 |
| 62 | 2.3168 | 1.8924 | 0.1302 | A892 | C727 | 1.5048349 | 2.3899870 | 0.0467405 | 0.4643372 |
| 63 | 1.2363 | -2.6444 | -0.3939 | A56 | C563 | 0.0872637 | -2.6105780 | -0.5796456 | 0.4562013 |
| 64 | -1.3204 | -0.6281 | 0.8430 | A260 | C261 | -1.3167277 | -0.2491240 | 0.6686894 | 0.1856530 |
| 65 | 1.3733 | 1.1877 | 0.9829 | A716 | C728 | 1.7540533 | 1.2884739 | 0.8592087 | 0.2017395 |
| 66 | 1.0874 | -0.1278 | 0.4251 | A511 | C584 | 0.9179814 | 0.5079842 | 0.2631802 | 0.3223742 |
| 67 | 2.1300 | -1.2171 | -0.8914 | A335 | C777 | 1.4974529 | -1.8159289 | -0.8105544 | 0.4374072 |
| 68 | 1.6863 | -0.5945 | 0.9773 | A467 | C807 | 2.2367156 | -0.0082026 | 0.8223858 | 0.4305424 |
| 69 | 0.8504 | 1.0927 | -0.7882 | A681 | C590 | 0.9634804 | 1.3193923 | -0.8514567 | 0.1343432 |
| 70 | 0.3029 | 1.0731 | 0.4656 | A630 | C430 | -0.0971827 | 1.0248170 | -0.1998627 | 0.3712761 |
| 71 | -1.4724 | 1.1331 | 0.9899 | A567 | C287 | -0.9738620 | 0.8474068 | 0.6197989 | 0.3847774 |
| 72 | -0.5452 | -1.2243 | 0.7514 | A223 | C386 | -0.7375099 | -1.5507936 | 0.8778894 | 0.2150976 |
| 73 | -1.6866 | 2.1137 | 0.7101 | A763 | C147 | -1.4121276 | 2.1263891 | 0.6619017 | 0.1117866 |
| 74 | 1.2012 | -2.0386 | -0.9305 | A163 | C777 | 1.4974529 | -1.8159289 | -0.8105544 | 0.2129565 |
| 75 | -0.2108 | 2.3579 | 0.9301 | A791 | C443 | 0.1333753 | 1.8734750 | 0.8721690 | 0.2955104 |
| 76 | -0.5982 | 1.3776 | -0.8671 | A656 | C236 | -0.9350442 | 1.7816903 | -0.8837452 | 0.2525266 |
| 77 | -0.2116 | -1.0573 | -0.3878 | A303 | C538 | 0.2727044 | -1.2341352 | -0.6402883 | 0.3045426 |
| 78 | -0.7802 | -0.9000 | -0.5880 | A275 | C406 | -0.5681405 | -0.8276460 | -0.0344534 | 0.2793200 |
| 79 | 1.0850 | -1.6815 | 1.0000 | A182 | C839 | 1.9894376 | -1.7503337 | 0.5523647 | 0.4736356 |
| 80 | 1.5563 | 0.1715 | -0.9008 | A617 | C685 | 1.3422137 | -0.3020255 | -0.7171627 | 0.2904164 |
| 81 | -0.3790 | 1.4273 | 0.8522 | A652 | C443 | 0.1333753 | 1.8734750 | 0.8721690 | 0.3261731 |
| 82 | -1.2769 | -0.2633 | 0.7178 | A347 | C261 | -1.3167277 | -0.2491240 | 0.6686894 | 0.0343714 |
| 83 | -1.6039 | 2.4566 | 0.3575 | A798 | C147 | -1.4121276 | 2.1263891 | 0.6619017 | 0.2754617 |
| 84 | -0.9297 | 2.4281 | -0.8000 | A797 | C236 | -0.9350442 | 1.7816903 | -0.8837452 | 0.2451664 |
| 85 | 0.5324 | -0.8526 | 0.1016 | A376 | C613 | 0.8381096 | -0.8753467 | 0.5607136 | 0.2625233 |
| 86 | 0.3928 | 1.5433 | -0.9132 | A722 | C590 | 0.9634804 | 1.3193923 | -0.8514567 | 0.2854438 |
| 87 | 1.0031 | 0.3850 | -0.3786 | A543 | C584 | 0.9179814 | 0.5079842 | 0.2631802 | 0.2832943 |
| 88 | -0.7562 | 0.7889 | -0.4207 | A536 | C430 | -0.0971827 | 1.0248170 | -0.1998627 | 0.3719239 |
| 89 | -1.0870 | -0.7523 | -0.7350 | A302 | C244 | -1.4496154 | -1.3855974 | -0.8882494 | 0.3830541 |
| 90 | -1.8671 | -0.8423 | -0.9988 | A199 | C244 | -1.4496154 | -1.3855974 | -0.8882494 | 0.3571109 |
| 91 | 0.8325 | -0.9413 | 0.6689 | A351 | C613 | 0.8381096 | -0.8753467 | 0.5607136 | 0.0599164 |
| 92 | -0.3355 | 0.9636 | 0.2005 | A574 | C430 | -0.0971827 | 1.0248170 | -0.1998627 | 0.2332990 |
| 93 | -1.0089 | -0.6007 | 0.5639 | A296 | C261 | -1.3167277 | -0.2491240 | 0.6686894 | 0.2547310 |
| 94 | 1.7725 | 1.7153 | -0.8845 | A833 | C590 | 0.9634804 | 1.3193923 | -0.8514567 | 0.4126568 |
| 95 | 0.5539 | -0.8888 | 0.3037 | A360 | C613 | 0.8381096 | -0.8753467 | 0.5607136 | 0.1848922 |
| 96 | 0.8149 | -2.6016 | 0.6874 | A73 | C628 | 0.4814534 | -2.2922875 | 0.7817855 | 0.2457149 |
| 97 | 0.1104 | 1.7654 | -0.9729 | A757 | C236 | -0.9350442 | 1.7816903 | -0.8837452 | 0.3836298 |
| 98 | 1.0107 | 0.3118 | 0.3349 | A537 | C584 | 0.9179814 | 0.5079842 | 0.2631802 | 0.1202075 |
| 99 | 2.2697 | -0.3642 | 0.9543 | A473 | C807 | 2.2367156 | -0.0082026 | 0.8223858 | 0.1736320 |
| 100 | 0.4983 | -0.8672 | -0.0185 | A376 | C613 | 0.8381096 | -0.8753467 | 0.5607136 | 0.3090567 |
hist(act_pred$diff, breaks = 30, col = "blue", main = "Mean Absolute Difference", xlab = "Difference")Figure 22: Mean Absolute Difference
We have considered torus dataset for creating a scored sequence of maps using scoreLayeredHVT() in this vignette.
Our goal is to achieve data compression upto atleast
80%.
We construct a compressed HVT map (torus_mapA) using the
trainHVT() on the training dataset by setting
n_cells to 900 and
quant.error to 0.1 and we were able to
attain a compression of 83%.
Based on the output of the above step, we manually identify the novelty cell(s) from the plotted map A. For this dataset, we identify the 7 cells as the novelty cells. (since torus dataset does not have outliers we are using this for demo purpose.)
We pass the identified novelty cell(s) as a parameter to the removeNovelty() along with HVT torus_mapA. The function removes that novelty cell(s) from the dataset and stores them separately. It also returns the data without novelty(s).
The plotNovelCells() constructs hierarchical voronoi tessellations and highlights the identified novelty cell(s) in red.
The data with novelty is then passed to the trainHVT() to
construct another HVT map (torus_mapB). But here, we set the parameters
n_cells = 11,
depth = 1 etc. when constructing the
map.
The data without novelty is then passed to the trainHVT() to
construct another HVT map (torus_mapC). But here, we set the parameters
n_cells = 30,
depth = 2 etc. when constructing the
map.
Finally, the set of maps - torus_mapA,torus_mapB,torus_mapC are passed to the scoreLayeredHVT() along with the test dataset to score which map and what cell each test record is assigned to.
The output of scoreLayeredHVT is a dataset with two columns Layer1.Cell.ID and Layer2.Cell.ID. Layer1.Cell.ID contains cell ids from map A in the form A1,A2,A3…. and Layer2.Cell.ID contains cell ids from map B as B1,B2… depending on the identified novelties and map C as C1,C2,C3…..
Topology Preserving Maps : https://users.ics.aalto.fi/jhollmen/dippa/node9.html
Vector Quantization : https://en.wikipedia.org/wiki/Vector_quantization
Sammon’s Projection : https://en.wikipedia.org/wiki/Sammon_mapping
Voronoi Tessellations : https://en.wikipedia.org/wiki/Centroidal_Voronoi_tessellation